A machine learning ensemble approach for 5- and 10-year breast cancer invasive disease event classification

Designing targeted treatments for breast cancer patients after primary tumor removal is necessary to prevent the occurrence of invasive disease events (IDEs), such as recurrence, metastasis, contralateral and second tumors, over time. However, due to the molecular heterogeneity of this disease, predicting the outcome and efficacy of the adjuvant therapy is challenging. A novel ensemble machine learning classification approach was developed to address the task of producing prognostic predictions of the occurrence of breast cancer IDEs at both 5- and 10-years. The method is based on the concept of voting among multiple models to give a final prediction for each individual patient. Promising results were achieved on a cohort of 529 patients, whose data, related to primary breast cancer, were provided by Istituto Tumori “Giovanni Paolo II” in Bari, Italy. Our proposal greatly improves the performances returned by the baseline original model, i.e., without voting, finally reaching a median AUC value of 77.1% and 76.3% for the IDE prediction at 5-and 10-years, respectively. Finally, the proposed approach allows to promote more intelligible decisions and then a greater acceptability in clinical practice since it returns an explanation of the IDE prediction for each individual patient through the voting procedure.


Introduction
Breast cancer is one of the main causes of death worldwide [1]. For this reason, the management of breast cancer patients has always been a topic of great interest within the scientific community. Surgery of various types followed by adjuvant therapies, such as chemotherapy, hormonotherapy, radiotherapy alone or in combination, represents the primary breast cancer treatment [2,3]. Despite great progress has been made to improve cancer survival [4], the choice of which adjuvant therapy must be performed for preventing the occurrence of invasive disease events (IDEs) after the primary tumor, such as recurrence, metastasis, contralateral and second tumors, still remains challenging [5,6]. Clinical experts make their decisions for each single patient following relevant guidelines [2], after having collected and evaluated a series of measurements of clinical and histological parameters. So far, several research works have demonstrated the central role of breast cancer subtypes on both tumor prognosis and therapy efficacy [7,8]. The first level breast cancer subtype classification involves immunohistochemistry-based subtypes, namely, Luminal-like, HER2-positive and Triple Negative tumors [9]. According to the identified subtype, patients eligible for a specific adjuvant treatment following surgery can be selected, thus sparing some other patients from unnecessary and/or potentially toxic treatments. However, due to the molecular heterogeneity of this disease, predicting the outcome and the efficacy of the adjuvant therapy tailored for each individual patient is very hard. Tools based on molecular biomarkers, such as mRNA biomarkers or genes, have also developed [10], complementing traditional histopathology methods. However, they are expensive and not all centers are provided with laboratories performing these types of analyses. Within this emerging scenario, the need of predictive models which make a trade-off between reliable predictions of therapy results and cost-effectiveness is urgent. In recent years, due to great advancements in the field of artificial intelligence applied to biomedicine, the design and development of machine learning models to support clinical decisionmaking processes in breast cancer treatment, have been extensively investigated in the stateof-the-art [11,12]. Their best potential consists of learning data models automatically without a prior hypothesized knowledge about variable interactions. Interdependence and complex nonlinear relationships among clinical data can be recognized [13]. Several studies based on machine learning have attempted to predict breast cancer recurrence/metastasis by solving a classification task [14,15] or developing survival models [16]. The prediction of composite events, namely IDEs, can play a relevant role within the adjuvant clinical trial setting for breast cancer [17]. Indeed, it is known that chemotherapy treatments may cause second tumors [18]. However, there is a lack of research studies focusing on IDE prediction. The occurrence of IDEs in breast cancer patients has been recently predicted by a Chinese research team, only in terms of survival rates [19]. In this work, a novel ensemble machine learning approach has been developed to address the IDE prediction as a binary classification task: breast cancer patients for which IDEs have or have not occurred at both 5-and 10-year follow-ups were discerned. The respective classes of belonging were indicated as IDE and non-IDE. The model was based on voting among multiple models [20]. When a coherent prediction among the models was obtained for a given patient, the patient was finally classified into one of the two class, IDE or non-IDE, and a final classification score was assigned. Conversely, when no coherence was found among the multiple models, no answer was given for that patient. In this last case, the ensemble model did not express a decision for that patient. Data from a cohort of 529 patients referred to Istituto Tumori "Giovanni Paolo II" in Bari, Italy, were used to train and validate the proposed approach.

Experimental data
This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Scientific Board of Istituto Tumori 'Giovanni Paolo II'-Bari, Italy. The number of the Protocol approved by the Scientific Board of Istituto Tumori 'Giovanni Paolo II' (Bari, Italy) was 6629/21). Written consent was not required from subjects, as it is retrospective study and involves minimal risk. All data were fully anonymized before analysis.
The experimental data analyzed in the current paper refer to a cohort of patients who were registered to Istituto Tumori "Giovanni Paolo II" in Bari, Italy, for a first breast tumor diagnosis in the period 1995-2019. Patients who underwent a neoadjuvant chemotherapy for breast cancer and/or had metastasis ab initio and/or had carcinoma in situ and/or followed up less than 10 years, while no clinical events have occurred in the meantime, were excluded from this study. A total of 28 features related to the primary breast tumor and the performed therapy was collected for 529 patients, who respected the eligible criteria. For those patients who were affected by bilateral or multiple tumors, the data referred to the greatest receptor expression were attributed. The data expressing the performed therapy were reported as in the following: chemotherapy (abbr. CT, values: Yes/No), hormone therapy (abbr. HT, values: Yes/No), trastuzumab (values: Yes/No), CT scheme (values: absent, Anthracycline (Anthra) + taxane, Anthra, taxane, CMF, other), HT scheme (values: absent, Tamoxifen (Tam), luteinizing hormone-releasing hormone analogues (LHRHa), Tam + LHRHa, aromatase inhibitor (AI), Tam + AI, LHRHa + AI, other), therapy combination (abbr. ther. comb, values: No, HT, CT, CT + HT, CT + trastuzumab, CT + HT + trastuzumab). Finally, information about CT duration expressed in months (abbr. CT months), time elapsed between the year of the first tumor diagnosis and surgery expressed in months (abbr. diag.-surg. months) and time elapsed between surgery and therapy initiation expressed in months (abbr. surg.-ther. months), were also collected. All the features' distributions are summarized in Table 1. Absolute values as well as percentage values are reported for categorical variables, whereas median values and first (q 1 ) and third (q 3 ) quartiles are specified for continuous values. Missing data are indicated as NA. Before data analysis, missing data of a given patient were replaced with the corresponding values of the patient with complete data whose feature vector had minimum Euclidean distance from the feature vector of the given patient [21].

Problem formulation
In this study, machine learning was used to predict the occurrence of invasive disease events at both 5-and 10-years in breast cancer patients who have shown a first infiltrating breast cancer. The term IDEs refer to composite clinical events, such as local and distant recurrence, contralateral invasive breast cancers, second primary tumors and death [17]. Two binary classification tasks to discriminate patients for which IDE have or have not occurred depending on whether the follow-up was fixed either at 5-years or at 10-years were formulated. The label IDE class indicated those patients for whom an event occurred within 5 or 10 years from the first tumor diagnosis date. A total of 142 patients of our database has shown an IDE within 5-years from the first tumor diagnosis, out of which 111 recurrence, 21 contralateral tumors and 10 second tumors, whereas 207 patients have shown an IDE within 10-years from the first tumor diagnosis, out of which 157 recurrence, 32 contralateral tumors and 18 second tumors. The label non-IDE class indicated the control cases defined as follow. For the 5-year IDE prediction model, patients with at least 10-year follow-up and without an event within that timeframe were considered as control cases (374 patients). Patients followed up less than 10 years, while no events were happened in the meantime, were not considered in the non-IDE class to avoid biases and noise, since they could develop an event shortly after 5-years. Under the same rationale, for the 10-year IDE prediction model, patients with at least 14-year follow up and without an event within that timeframe were identified as control cases (322 patients).

Machine learning ensemble approach rationale
A predictive approach to give a prognostic prediction of IDE occurrence separately at 5-and 10-year follow-ups was designed. Its architecture consists of a machine learning-based ensemble approach which exploits the concept of voting among multiple models [20]. In other words, the proposed approach combines, according to specific rules described in S1 File, the predictions of three baseline models (Fig 1), named as Model 1, Model 2, Model 3. The approach was developed and evaluated on the entire dataset according to a 10-fold cross validation scheme in order to observe the performance over the entire dataset. In this way, training and test sets were created ten times independently by means of random sampling stratified with respect to the real class of belonging (IDE class/non-IDE class). Each test set was composed by the 10% of the total number of patients. In the following, the ten test sets are referred as ten independent tests. For each of the ten times, Model 1 took in input all the collected 28 features of the training set. In Fig 1 it is named as original model since it exploits the raw data. By starting from the training set of Model 1, a so-called cleaning up procedure, was performed thus obtaining another two models, Model 2 and Model 3. The details about the cleaning up procedure are reported in S1 File. The procedure is represented in S1 Fig. Basically, cleaning up procedure is based on four standard machine learning algorithms which classified the patients of the training set as IDE or non-IDE patients. All the patients marked as "misclassified" by all the classifiers were considered as confounding samples. By removing these patients from the training set, a model without confounding patients, Model 2, was then created. On the other hand, all the neglected patients were collected to define a new model, Model 3, with confounding patients only. The rationale under the concept of confounding patients is better clarified in the first sub-section of the Results section. Each of the three models was based on the same inner functioning procedure. First, a nested feature selection via Boruta technique has been applied on the training set through a 20 5-fold cross validation scheme rounds. Then, the selected features on the training set were used to train a machine learning classifier, eXtreme Gradient Boosting (XGBoost) [22], which returned a classification score for each patient of the test set (named as score s1,s2,s3 for Model 1, Model 2, Model 3, respectively; see Fig 1). A patient of the test set was assigned by XGBoost classifier to the IDE-class if the predicted score exceeded a threshold, which was computed separately for each of the three models (and indicated as th1, th2, th3, respectively) as the ratio of the IDE-patients of the training set over the total amount of patients of the same set. As final step, an ensemble model was constructed by combining the classification scores obtained in correspondence of the three models according to the rules which are described in S1 File, and represented in S3 Fig. After the application of the rules, when a consensus among the three baseline models was obtained, the patient was finally classified into one of the two class, IDE or non-IDE, and a final classification score was assigned. Conversely, when no coherence among the three scores was judged by the rules, no answer was given for that patient. In this last case, the ensemble model did not express a decision for that patient. The rules at the basis of the ensemble model required the setting of 8 parameters that were computed from the distributions of scores returned by the three models on the ten training sets (see S1 File and S1 Table). A large set of trial solutions was examined by combining five trials for each of the parameters. A total of 5 8 = 390625 parameter combinations across the ten independent tests were explored. To the aim, the solutions was drawn from a regular grid in the model space: this procedure was called grid search. A more detailed explanation of the grid search procedure in S1 File. All the experimental simulations were implemented using the R Software (v. 4.1.1, R Foundation for Statistical Computing, http://www.r-project.org/) and MATLAB R2019a (MathWorks, Inc., Natick, MA, USA) software. The models' performance were evaluated in terms of the Area Under the receiver operating Curve (AUC) and, once the optimal threshold was identified by Youden's index on ROC curve [23], standard metrics, such as accuracy, sensitivity and specificity were also computed: Sensitivity ¼ TP=ðTP þ FNÞ ð2Þ

Confounding patients
Starting from the initial dataset, the classification tasks (at 5-and 10-years) were performed through Model 1, i.e., the original model taking in input the raw data. Model 2 was constructed from Model 1 by means of the implementation of the so-called cleaning up procedure based on four standard machine learning algorithms, such as XGB [22], Random Forest [26], Naïve Bayes [27] and SVM [28] (see S1 File). Confounding patients were identified as those patients marked as "misclassified" by all the classifiers. They were then excluded in the development of Model 2, whereas they were used to define Model 3. These patients were considered as confounding in model-independent way, since they appeared as indistinguishable with respect to the considered features for all the employed classifiers. Specifically, they presented similar clinical feature values to other patients of the database but belonged to opposite class. For this reason, it was not possible to define a unique model able to discern confounding patients to other patients. Fig 2 shows the consensus maps among the four classifiers computed in terms of Cohen's kappa (κ) coefficient [29] between pairs of classification scores (e.g., XGBoost vs RF) before and after the cleaning up procedure applied for the 10-year IDE prediction (Fig 2A and  2B, respectively). The κ coefficient values range in the interval [-1;1]. The higher the κ coefficient, the greater the consensus is. Overall, the consensus among the classifiers increased after the cleaning up procedure (brighter values within the color map), thus meaning that by excluding confounding patients, the characteristics among the remaining patients are more homogeneous. The same behavior has been observed for the 5-year IDE prediction, but it is omitted to not burden the discussion.

Performances of the baseline models composing the ensemble model
The ensemble model proposed here, relies on the concept of voting among three baseline models. Here, the main results achieved by the three baseline models separately were discussed. For each of the three models, the statistical frequency of the features selected by means Boruta technique over each of the ten training sets though a 20 5-fold cross validation rounds for the 5-year follow-up and 10-year follow-up are depicted in Fig 3A and 3B, respectively. The most important features for Model 1 at both 5-and 10 follow-ups (with a percentage statistical frequency equal to 100% for all the ten training sets) are ER and Ki67, whose role is well known to be crucial in the decision of what therapy path to perform [8] (upper panels of Fig 3A and  3B). Other important features, selected especially for the 5-year follow up, are grading, the sentinel lymph node, the number of both eradicated and metastatic lymph nodes, lymphadenectomy, which, in the state-of-the-art, have been proven to be prognostic factors of recurrence [30,31]. While grading reveals to be as an important feature for the 5-year follow up, in situ component, which has been recently recognized as a significant risk factor for intramammary recurrence [32], emerges as one of the most selected features for the 10-year follow-up. The number of selected features with a percentage statistical frequency of 100% increased for Model 2 at both 5-and 10-year follow-ups (central panels of Fig 3A and 3B). With respect to therapies, while HT scheme feature is mostly selected for the 5-year follow-up, the CT scheme is identified with a greater statistical frequency for the 10-year follow-up. The multiplicity feature emerges as important for the 5-year follow-up, whilst LVI and in situ component reach higher percentage statistical values for the 10-year follow-up. Finally, the most discriminative features for Model 3 at both the follow-ups (lower panels of Fig 3A and 3B) are represented by all the features related to lymph nodes. For the 10-year follow-up, other features, such as in situ component, ER and Ki67 reveal to be frequently selected.  of 87.0% and 82.4% for the 10-year follow-up). Table 3 summarizes the percentage median, first and third quartile values of all the evaluation metrics achieved by all the three models on the ten independent tests. The AUC values reached by Model 2 decreased up to around 14%: a median AUC value of 70.5% and 70.7% were obtained for the 5-year and 10-year follow-up, respectively. As emerged from

Performances of the proposed machine learning ensemble approach
Despite being stable between the training and the test sets, the results achieved by Model 1 do not make this model as being utilized in the actual clinical practice. Experimental results shown that a hidden pattern could be identified both by excluding confounding patients (Model 2) and by considering all confounding patients only (Model 3), but their performances show a gap when applied on the test sets. By merging the decisions of the three models, an ensemble approach that reached promising results was obtained, thus enabling its application in the actual clinical practice. Here, the performance achieved by the optimal ensemble model (in terms of AUC value) with a unique combination of parameters valid for all the ten independent test by imposing a percentage median value for the "no answers" given by the model over all the sets of patients as maximum 25% was presented. The percentage median values as well as the first and the third quartile values of all the evaluation metrics for both 5-and 10-year follow-ups are summarized in Table 3: a median AUC value of 77.1% and 76.3%, a median accuracy value of 75.5% and 71.3% were achieved for the 5-year follow-up and the The performances are evaluated in percentage median values of AUC, accuracy, sensitivity, and specificity. The percentage first and third quartile values are also computed and reported in brackets.
https://doi.org/10.1371/journal.pone.0274691.t002 Table 3. Summary of the performances for the 5-and 10-year follow-up achieved by Model 1, Model 2, and Model 3 over the ten independent test sets. 10-year follow-up, respectively. The model was optimized in terms of the AUC metric since it is the only metric not depending on the threshold identified by the computation of Youden's index. Anyway, the ensemble model outperformed all the three baseline models with an overall improvement for all the other evaluation metrics. A direct comparison with the original model, i.e., Model 1, is also reported in the following. The upper panels of Fig 4A and 4B depict the AUC value achieved by the proposed ensemble model over all the ten independent tests separately (orange lines) alongside the AUC values obtained by Model 1 on the same sets (blue lines) for the 5-year follow-up and the 10-year follow-up, respectively. In the lower panels of Fig 4A and 4B, the percentage of the "no answers" returned by the ensemble model over all the independent test sets for the 5-year follow-up and the 10-year follow-up is shown, respectively.

Follow-up Model AUC (%) Accuracy (%) Sensitivity (%) Specificity (%)
The median values of "no answers", which are 23.4% and 23.6% for the 5-year follow-up and 10-year follow-up, respectively, are also reported as black dotted lines. An overall improvement of AUC values over all the independent test sets was obtained, especially for the 5-year follow-up. Here, a direct comparison of the ensemble model with Model 2 and Model 3 was neglected since they were derived from Model 1, which is also the most stable model with respect to the evaluation on the training and test sets.

Discussion
An accurate prediction of breast cancer recurrence and analogous events could aid doctors in making better decisions about adjuvant treatment planning with an improvement in cost reduction and prevention of excessive treatment [33,34]. Therefore, predicting the occurrence of recurrence in term of survival or classification has become a major issue in the current research on breast cancer. The 5-year follow-up is the most common benchmark in the breast cancer research field [16]. However, since breast cancer patients often experience events in a longer time and then longer-term therapy effect (e.g., the effect of hormone therapy) needs to be probed [35], also a 10-year follow-up is also frequently considered. PREDICT and Adjuvant! Online are two of the most popular tools at international level [36,37], which are able to give prediction about the occurrence of recurrence probability for breast cancer patients in terms of survival, as they have been validated on diverse cohort of breast cancer patients in both United States and Western Europe. The subject of IDE prediction, despite being less investigated in the state of the art, is of great interest in the adjuvant clinical trial setting [17]. Treatment-related causes may play a role in the occurrence of second tumors or contralateral breast cancers [18,38]. To date, results are inconclusive and, hence, the prediction of occurrence of composite events, i.e., IDEs, need to be investigated. The IDE prediction has been recently probed for the first time by Fu et al. [19]. They developed a 5-year survival model based on XGBoost which made use of patients' characteristics related to demographics, diagnosis, pathology, and therapy. In this work, we wanted to make a contribution in the field of breast cancer IDE prediction. We developed a novel ensemble machine learning approach, based on the concept of voting among multiple models, which is able to predict in terms of classifications the occurrence of invasive disease events after the primary tumor, such as recurrence, metastasis, contralateral and second tumors at both 5-and 10-years follow-ups. The developed method has revealed to obtain promising performances on ten independent tests for both the follow-ups: a median AUC 77.1% and 76.3%, a median accuracy value of 75.5% and 71.3% were achieved for the 5-year follow-up and the 10-year follow-up, respectively. The method was also able to outperform the original predictive method, which took in input all the raw data of the entire dataset and returned a median AUC value of 68.1% and 68.0%, a median accuracy value of 66.8% and 64.3% for the 5-year follow-up and 10-year follow-up, respectively. A fundamental peculiarity of the proposed model is the ability to identify the so-called confounding patients and exploit them to define a consensus-based model. To the best of our knowledge, the proposed model is the first ensemble model within the field of IDE prediction, thus proposing an innovative angle from which probe and address the IDE prediction task. The concept of voting on which the model is based allow us to obtain better clarifications about the decision made by the classifier than standard machine-learning models, that, even if based on sophisticated mathematical underpinnings, usually share the common trait to fail in explaining in transparent and easy ways how a specific decision is achieved, thus hindering their applicability in clinical practice. The achieved promising results make this work as a first effort towards the implementation of a more intelligible method. However, at this step, the proposed model cannot be implemented in clinical practice yet since it required a validation on a wider cohort of patients, preferably including data collected across multiple centers. Moreover, this study has the limitation to have analyzed a heterogeneous sample population, since the period of the first tumor diagnosis was over 20 years (from 1995 to 2019). During this period, several pharmacological and treatment generations are being succeeded with a great impact on the predicted outcome. As example, the introduction of regimens containing anthracycline and taxane administered sequentially or in combination has resulted in a reduction of 16% of the risk of recurrence [39]. Randomized clinical studies within the adjuvant treatment framework, such as HERA, NSABP B-31, NCCTG N9831 and BCIRG 006, have also demonstrated how the addition of Trastuzumab to the typical therapy schemes has dramatically changed the natural history of HER2/neu+ breast cancer patients [40]. However, the developed model has learned to recognize even the effects of the first-generation drugs such as CMF, which, even if more rarely, are still used in clinical practice alongside new generation drugs. As future works, a wider cohort of patients will be used to evaluate the model generalizability and robustness alongside to the development of a tool that is able to distinguish the specific typology of the invasive disease event (recurrence, contralateral breast cancer and second tumor). The addition of radiomic features extracted by primary tumor diagnostic imaging exams, e.g., ultrasounds or mammograms, could be also investigated [41,42].

Conclusions
The current work presents a novel ensemble machine learning approach that, based on voting among three baseline models and grid search procedure, was able to predict the occurrence of 5-and 10-year invasive disease events in breast cancer patients. When a coherent prediction among the baseline models was obtained, the model returned a specific prediction for a given patient. Conversely, when no consensus among the baseline models was reached, the ensemble model remained unexpressed about that patient. The identification of confounding patients as well as the definition of an ensemble method, that returned a decision only when a consensus among the inner classifiers is reached can be considered as innovative aspects with respect to IDE prediction challenge. In the vein of the bursting concept of explainable artificial intelligence within the biomedical field, this ensemble model is able to return more intelligible choices as it is based on the concept of voting among multiple models. This aspect is particularly important for an easier clinical applicability of a decision support system that, just as the model proposed here, shows promising results. A greater interpretability of the results could lead clinicians to be more prone to adopt medical artificial intelligence methodologies in the actual clinical practice.  Table. Summary of the parameters appearing in the rules of the ensemble model. For all the parameters, a brief description is reported in the second column. The range of variation represents all the possible values that a specific parameter could assume, whilst the step indicates the distance between two subsequent values within that range. The parameters th1, th2 and th3 do not require a range of variation since they were automatically computed on each of the ten training sets separately, as reported in the second column. Except for these three last parameters, the other eight parameters can assume five possible values and their optimal parameter combination under set conditions was identified by implementing the grid search procedure. Since these parameters were computed for the ten training sets separately, for each of them, a distribution was obtained, and the quantiles of certain order were computed. The extremes of the respective ranges were obtained by averaging the quantiles of specific order over the ten training sets. The number following the word quantile indicates the order of that quantile.